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ABSTRACT 



The persistence of life requires populations to adapt at a rate commensurate with the 
dynamics of their environment. Successful populations that inhabit highly variable 
environments have evolved mechanisms to increase the likelihood of successful adap- 
tation. We introduce a 64 x 64 matrix to quantify base-specific mutation potential, 
analyzing four different replicative systems, error-prone PCR, mouse antibodies, a 
nematode, and Drosophila. Mutational tendencies are correlated with the structural 
evolution of proteins. In systems under strong selective pressure, mutational biases 
are shown to favor the adaptive search of space, either by base mutation or by recom- 
bination. Such adaptability is discussed within the context of the genetic code at the 
levels of replication and codon usage. 
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1 Introduction 



Viable mutations can potentiate the emergence of new life forms and the adaptation of liv- 
ing organisms to new environmental constraints. Evolution occurs through a hierarchy of genetic 
events, including base substitution, homologous recombination, insertions, deletions, rearrange- 
ments, transpositions, and horizontal transfers ( [Lawrence, 1997t[Pennisi, 1998| ). Systems such as 
the adaptive immune system use somatic hypermutation to rapidly search protein space to combat 
infectious agents. Likewise, error-prone PCR is used in molecular evolution protocols to search 
space in order to optimize protein function. In addition, pathogens and cancers have evolved effec- 
tive dynamic mechanisms, often predicated on base substitution, to evade immune and therapeutic 
selection. In HIV, for example, the high rate of viral mutation makes the development of a vac- 
cine difficult and results in the rapid onset of resistance to many current drugs. Indeed, there is a 
correspondence between the ability of HIV to evolve drug resistance, the drug regimen given, and 
the genetic makeup of the strains present in a patient ( [Lathrop and Pazzani, 1999| ). Crucial to a 
thorough understanding of the base substitution process is a mathematically precise quantification 
of the various mutation rates. 

The mutational machinery of hypermutation and recombination is under environment-dependent 
regulation ( [Bull et ah, 200 1| ). Studies have shown that regulation is possible both in the process 



of replication and error correction ( |Sutton and Walker, 200 1| and in the type of polymerases ex- 



pressed ( [Friedberg et ai, 2000 , [Storb, 20(JT] ). The mechanism for maintenance of adaptability 



traits is population-based and requires a dynamic environment. That evolvability is a group se- 
lectable trait has been shown in simulations of digital organisms ( |Travis and Travis, 2002t|Peper, 2003 1 



Otriag?a/.,T999l|ThearlingandRay, 1997t|mgner and Altenberg, 1996t|Altenberg, 1994| ). Many 



of the biochemical events necessary to modify adaptability are known. At the simplest level, mu- 
tation of a single amino acid site in the Taq Pol I enzyme is sufficient to greatly modulate the 
accuracy of DNA replication ( |Patel et ai, 2001| ). 



The goal of this paper is to show how species can use evolution in populations as a space 
searching advantage in the context of the genetic code. Base-to-base rates of synonymous, conser- 
vative, and non-conservative mutation tendencies for each codon are described, thus allowing for 
the quantification of evolutionary potential. Base-specific mutation rates are dependent upon the 
fidelity of the replication machinery, flanking sequences, and other environmental conditions. The 
base substitution rate is non-uniform because transitions are typically greatly favored over transver- 
sions, and purines are typically substituted at a greater frequency than pyrimidines. However, dif- 
ferent replication systems have different base specific mutation probabilities. It is here argued that 
within the context of genetic code, the emergence of replication variants that modulate not only the 
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overall rate but also base-specific mutation rates allows populations to increase the probability of 
searching productive survival space under dynamic environmental constraints. Our theory comple- 
ments previous observations for the immune system ( [Kepler, 1997 1, recent observations of codon 



bias toward increased adaptability in Influenza A ( |Plotkin and Doshoff, 2003| ), and recent work on 



digital organisms showing how adaptability evolves within a population ( |Travis and Travis, 2002t 



Peper, 2UU3HUfria a/. , lM|Thearling and Ray, 1 [Wagner and Altenberg, 1996t|XItenberg, 1994 ) 



A codon mutation matrix that defines in a precise manner the probability per round of a codon 
mutating by base substitution into another codon is introduced. This matrix provides the rates of all 
possible 64 x 64 mutations. From these detailed rates, properties of the codons themselves can be 
calculated. For example, the codon mutation matrix allows the classification of codons according 
to their synonymous, conservative, or non-conservative mutabilities. Codons that tend to mutate 
in a more dramatic, non-conservative fashion are characterized as having a higher evolutionary 
potential, allowing for a more rapid short-term adaptation. 

To describe mutabilities of codons, there currently exists the Kg and notation ( [Li et ah, 1985[ ). 



The parameter Kq, describes the number of synonymous substitutions per site, and Kp^ describes 
the number of non-synonymous substitution per site. Because of its average character, and because 
it is based upon sequences that have undergone selection, the Ka, and description is limited to 
estimating the number of synonymous and non-synonymous nucleotide substitutions between ex- 
ons of homologous genes. 

Some approaches that exist are indirect measures of intrinsic adaptability at the genetic level. 
The PAM and BLOSUM matrices, for example, describe mutabilities between amino acids rather 
than between codons ( payhoff a/., 1978[ [Henikoff and Heniko ff, 1992[[Durbin g? q/., l998i . More- 



over, a matrix of pure mutational tendencies is ideally constructed from data gathered from non- 
selected genomic data, such as intron regions or pseudo-genes. Yang and Kumar have developed 
what is known as the Q matrix ( [Yang and Kumar, 1996 |. This matrix quantifies the underlying 
mutational pattern of nucleotide substitution. This 4x4 matrix, which deals with bases rather than 
codons, will be useful in our development. A codon mutation matrix based upon the assumption 
that the ratio of transition to transversion mutation rates is constant and that the ratio of nonsynony- 
mous to synonymous mutation rates is constant has been developed ( [Goldman and Yang, 1994 1. 



This matrix can capture mutational data that are consistent with the assumption of equal transition 
to transversion and nonsynonymous to synonymous mutation rates. Our 64 x 64 matrix separates 
the species- specific mutation probabilities, and it additionally allows us to quantify the efficacy, 
type, and biases of subsequent codon mutation changes in the context of the genetic code. 

In the context of pathogen and disease evolution, the mutation matrix can be a valuable 
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tool to quantify mutation probabilities and to enable the design of therapeutics and vaccines that 
would most effectively target disease epitopes that have the lowest chance of evolutionary escape 
( Freire, 2002| ). In the context of laboratory evolution of proteins, or protein molecular evolution 
( Patten et al, 1997t |Lutz and Benkovic, 2000t [Petrounia and Arnold, 2000| ), knowing the tenden- 
cies of codons to mutate synonymously, conservatively, or non-conservatively would be helpful in 
experiment design. 



2 Methods 

2.1 The Codon Mutation Matrix 

As an approximation, it is initially assumed that each base in a codon mutates independently. 
This allows the 64 x 64 codon mutation matrix to be constructed from the 4x4 base mutation 
matrix. In particular 

Tij = thj-i_ti2j2^hj-i 1 (1) 

where i is the number of the codon that will be mutated and j is the number of the codon that results 
after the mutation, with 1 < j < 64. The codon is denoted by ^l^2^35 where ii is the first base in 
codon i, 12 is the second base, and % is the third base, with 1 < ii, ^2, ^3 < 4. Similarly, jij2j3 is 
the base triplet for codon j. The probability of a mutation from codon i to codon j in one round 
of replication is given by the codon mutation matrix T^j. In this mathematical representation, the 
probability per round of no mutation is given by Ta. Since either a mutation occurs or no mutation 
occurs, this matrix satisfies the constraint 

64 

Tij = 1 for all i . (2) 

The probability per round of a mutation from one base to another is given by the base substitution 
matrix t. The base mutation matrix also satisfies conservation of probability J2%=i Uiji = 1- This 
definition leads to what is known mathematically as a discrete-time Markov process. The base 
mutation matrix t can be constructed from information about the mutation frequency for the four 
bases. A, C, G, and T. The non-diagonal elements of t are derived from the 12 different independent 
rates of mutation. Typically the non-diagonal elements are small, since the rate of mutation is on 
the order of 10^^ — 10^^ per base per replication. The diagonal elements of the base mutation 
matrix are computed from the conservation of probability constraint. The 64 x 64 codon mutation 
matrix is then constructed from the 4x4 base mutation matrix by equation [TJ Each element of the 
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64 X 64 matrix thus gives the probability per round of one codon mutating to another codon. One 
round, or codon mutation step, can include zero, one, two, or three simultaneous base mutations. 

The assumption that DNA bases mutate independently can be refined in the presence of ad- 
ditional experimental data. It is known, for example, that flanking bases affect the base mutation 
rate in the hypervariable region of mouse antibodies ( |Smith et ah, 1996) . Overall mutation rates 
have been measured for base triplets, and this information can be used to refine the codon mutation 
matrix. If uoi is the observed mutation rate for codon i, the improved codon mutation matrix T' is 
defined as 

= ^T,, , (3) 

where z is a constant chosen so that the average mutation rate of the codons remains unchanged 
by this operation: z = i^iTij/ Tij. Alternatively, the assumption of equal transition to 
transversion and synonymous to nonsynonymous mutation rates may be used to generate a refined 
codon mutation matrix (Goldman and Yang, 1994 1, although this will not be done in the present 
work. 

The codon mutation matrix differs from organism to organism and is constructed here for 
several specific systems. Since comparative trends are of interest, the overall average base mutation 
rate is set to be the same in all species, 2 x 10^^ per replication. A different average mutation rate 
for each species would simply adjust the overall scale of the codon mutation matrix. In each case, 
the base mutation matrix is first constructed from available data, and then equation [l] is used to 
construct the full codon mutation matrix. 

The 64 X 64 codon mutation matrix contains a total of 4096 elements, each element calculated 
from Equation[T]or Equation|3J For each codon a synonymous, conservative, and non-conservative 
mutability is defined. The synonymous mutability, for example, is the sum of all of the elements of 
the codon mutation matrix that change a codon by a synonymous mutation. Similarly, the conser- 
vative mutability is the sum of all of the elements of the codon mutation matrix that change a codon 
by a conservative mutation. A conservative mutation occurs when a codon mutates to a codon that 
codes for a different amino acid that is, however, similar to the amino acid originally encoded. 
Amino acids are similar if they are in the same group, and there are seven groups: neutral and po- 
lar, positive and polar, negative and polar, nonpolar with ring, nonpolar without ring, cysteine, and 
stop. Substitutions that change the amino acid to a different group are defined as non-conservative, 
and substitutions that retain the encoded amino acid are defined as synonymous. Finally, the non- 
conservative mutability is the sum of all of the elements of the codon mutation matrix that change 
a codon by a non-conservative mutation. These three mutability values express the probability that 
a specific codon will mutate synonymously, conservatively, or non-conservatively in one round of 
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replication. 



2.2 Systems Studied 

The mutation frequencies of the Taq polymerase in error-prone PGR are available and can be 
extracted dMoore and Maranas, 2000| ). In the context of protein molecular evolution ( jPatten et al, 1991\ 



P^utz and Benkovic, 20001 IPetrounia and Arnold, 20001 ), understanding the mutational process in 
error-prone PGR is especially important. The base mutation matrix for this, and the other systems, 
is available in the Supplementary Information. The three mutability values for each codon for this 
system are shown in Figure [iji. 

The codon mutation matrix is also constructed for mutations in the intronic V regions of mouse 
antibodies ( [Smith q/., 1996] ). Equation|3lis used to account for the effect of flanking bases in the 



mutation process, using JH/Jk intronic data ( .Shapiro q/., 1999| ). The mutability values for this 
system are shown in Figure [IJ5. 

The data from non-long terminal repeat retrotransposable elements are used to construct the 
4x4 base mutation matrix for Drosophila ( |Petrov and Hartl, 1999| ). Only the data from the ter- 
minal branches, representing "dead-on-arrival," nonfunctional copies that are unconstrained by 
selection were used. These copies evolve as pseudo-genes. 

The last system for which a codon mutation matrix is constructed is mitochondrial DNA from 
Haemonchus contortus ( |Blouin et al., 1998) ). This is a nematode in the same subclass Rhabditia 



as Caenorhabditis elegans. Goding regions of mtDNA were used to allow for comparison with 
codon usage data available in the literature. The base mutation matrix obtained from this data 
is treated as applicable to nuclear DNA, and so the standard genetic code is used. While use 
of intronic data from C. elegans would be preferable, such data are difficult to collect due to the 
extensive divergence between C. elegans and its near relative, C. briggsae (T. Blumenthal, personal 
communication, 2001). The mutation rate data estimated by the mtDNA mutation rates will not 
play an essential role in the analysis. 

2.3 The No-Bias Codon Mutabilities 

We are looking for biases in the underlying mutation rates of the replication machinery, not 
for biases in the genetic code itself. The genetic code biases-that hydrophobic residues tend 
to mutate to hydrophobic residues and that hydrophilic residues tend to mutate to hydrophilic 
residues-are well known dWoese, 19651 [Epstein, 1966[ [Goldberg and Wittes, 1966[ [Fitch, 196^ 



Volkenstein, 1994[). To investigate biases other than those induced by the genetic code, a refine- 
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ment to the codon mutability plots is made. This refinement subtracts from each mutability a 
value termed as the "no-bias" value. The "no-bias" value comes from a 64 x 64 matrix that is 
created by using a 4 x 4 matrix where each non-diagonal term has equal mutation frequencies, e.g. 
equal transition and transversion rates. In other words, the no-bias plots indicate which empiri- 
cally derived mutabilities are above or below those expected if all base substitutions were equally 
likely. This matrix serves as a baseline for unbiased mutation rates within the context of the ge- 
netic code. This no-bias transformation is not a correction: it is a refined way to do the analysis. 
The overall mutation rate of the no-bias codon mutation matrix is made to be same as that of the 
original codon mutation matrix. Synonymous, conservative, and non-conservative mutabilities are 
calculated from this baseline 64 x 64 matrix and subtracted from the original mutabilities. Figure 
11 

3 Results 

3.1 Modulation of Codon Mutation Rates 

Error-prone PGR, while not a pure biological system, is a central tool and serves as an excel- 
lent example of the power of our approach. Figure|2^ immediately reveals that for error-prone PGR, 
the codons that code for polar amino acids have low relative conservative and non-conservative mu- 
tabilities. That is, these mutabilities are much lower than what would be expected under unbiased 
conditions. For the codons that code for the nonpolar amino acids, on the other hand, a different 
pattern is observed. In this case, the conservative and non-conservative mutabilities are higher than 
the baseline values generated from equal mutation rates. Note that because of the factorization in 
Equation [H our theory describes the biasing effect of base mutations, and the "reading frame" of 
Taq does not matter. In Figures [1^ andEk we are showing the effect of these biased base mutations 
when the ribosome reads the exons in frame. 

To study the possible effects of mutability modulation in a natural population undergoing 
rapid, active evolution, the mouse V regions are examined with the 64 x 64 mutation matrix ap- 
proach. Interestingly, higher conservative and non-conservative mutabilities are observed for the 
polar amino acids compared to the nonpolar amino acids. Figure |2b. We quantify the statistical 
significance of these results by computing the probability per round that a random base mutation 
matrix would lead to a ratio of mutation rates between the polar groups and the nonpolar groups that 
is as great or greater than that observed. That is, we take the ratio of the sum of the conservative and 
non-conservative mutabilities from Figure [l] for these two groups. The probability by chance that 
this ratio is as large or larger than that in Figure[lJ) is 8.6%. From this extremely conservative statis- 
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tic, it can be concluded that the pattern of increased mutability of polar amino acids is statistically 
significant to the level of 91%. We also perform this same calculation using another, independent 
estimate of the base mutation matrix for mouse V regions ( |Neuberger and Milstein, 19'95| ). The 
probability by chance that the ratio of conservative and non-conservative mutabilities for a random 
matrix is larger than that given by this new matrix is 5.3%. This result is, thus, significant to the 
level of 95%. It is interesting to note that if one assumed the experimentally measured base muta- 
tion matrices were random, i.e. dominated by experimental noise, the probability that two random 
such matrices would give a ratio as large or larger than that observed in Figure[lj5 is 0.086^ = 0.7%. 

It is difficult to measure experimentally exact mutation rates. Thus, the sensitivity of the 
codon mutation matrix to changes in the base substitution rates is of interest. In order to test the 
robustness of our findings for Taq to experimental noise, a random number is added or subtracted 
from each of the twelve off-diagonal, independent values in the 4x4 base mutation matrix. This 
random number is generated from a Gaussian distribution with zero mean and a standard deviation 
that is equal to a given percentage of the average mutation rate. This procedure generates a new 
4x4 base mutation matrix, from which a new 64 x 64 codon mutation matrix is calculated. To 
determine if the mutability bias patterns found in Figure is perturbed by the addition of noise, 
codon mutability plots are created with the new codon mutation matrix. This plot displays the 
pattern observed in Figure until the noise overwhelms the signal. The pattern from Figure Eh 
is still evident up to noise levels of 50% of the average mutation rate, disappearing only when 



the noise reaches 60% ( |Tan, 2002| ). An analagous calculation was performed for the mouse V 
region system, and again the pattern in FigureEh persisted up to noise levels of 50% of the average 
mutation rate, disappearing only when the noise reaches 60% ( |Tan, 200^ . Thus, the observed 
trends in Figure E] are rather robust to the presence of experimental noise. 

One might wonder whether this pattern of increased non-synonymous mutabilities of charged 
residues would survive in other mouse or mammalian genes. Figure E] shows the no-bias codon 
mutability plot derived from non-immune-system gene mutation rates from human B cells. Data 



are from (Sh en et al., 2000| ). As expected, there is no overall pattern. A quantitative comparison to 
the polar to nonpolar ratio of conservative and non-conservative mutabilities calculated for Figure 
[TJ) shows that in this case the probability that a random base mutation matrix has a value higher 
than that observed in FigureEJis 25%. Thus, the increase in the non-synonymous mutability of the 
mammalian, immunoglubulin V region in Figure [TJ? is unique and statistically significant. 

Further analysis of the codon mutation matrices was done by combining mutability informa- 
tion with codon usage information. Codon usage is necessary to determine via the mutation matrix 
the average rate of mutation of a gene, since the total rate of mutation depends on both the muta- 



9 



tion rate per codon and on which codons are present. By summing the product of the RSCU value 
( Sharp et ah, 1986|) and the synonymous mutability for all the codons that code for a given amino 



acid, the synonymous mutability of amino acid a is calculated: 

synonymous mutability (a) = x synonymous mutability (i) , (4) 



Ida 



where the synonymous mutability of codon i is taken from Figures [It-[It, and the codon usage 
is taken from the experimental RSCU values ( |Duret and Mouchiroud, 1999] ). The synonymous 



mutability of amino acids is observed to be higher in the short genes than in the long genes for 
the nematode, Figure HJ Indeed, of the amino acids, only arginine has a demonstrably lower syn- 
onymous mutability for the short genes, as seen in Figure HJ We calculate the probability that 
the observed increase in synonymous mutability is due to chance. The probability of 17 or more 
amino acids showing this trend out of 18 by chance is [(Jg) + (|[7)]2^^'^ = 7.2 x 10^^. Making 
the same plot for the nematode, one observes the pattern to be even more striking ( |Tan, 200^ 
Blouin et ah, 19981 ) (data not shown). Indeed, of the amino acids, only proline has a demonstrably 



lower synonymous mutability for the short genes, and only two other amino acids have roughly the 
same synonymous mutability in short and long genes. The probability of 15 or more amino acids 
showing this trend out of 16 by chance is [Q^) + {^)\2-^^ = 2.6 x 10"^. 

While there are selective pressures on synonymous codon usage, such as preference for tRNAs 
at different levels of abundance, it seems unlikely that there would be a selection on the quantity 
synonymous mutation rate, in and of itself, that is significant enough to cause the observed corre- 
lation. In other words, there are known to be selective pressures on codon usage. What is not clear 
is why there should be selective pressure on synonymous mutation rate itself. There is selection 
pressure on the ability to adapt, however. In order for short genes to evolve at an overall rate com- 
parable to that of long genes, the mutation rate per base would have to be higher in short genes. 
If one assumes that on average there is a certain number of mutations needed to effect functional 
adaptation of a protein, and that short proteins and long proteins need to evolve at roughly sim- 
ilar rates, this then implies that short proteins need a higher per base rate of evolution than long 
proteins — because they are shorter, and the evolution rate of a gene is the evolution rate per base 
times the number of bases. Thus, the evolution rate per base must be higher for shorter proteins. 
In contrast to Figure IH however, a correlation between conservative or non-conservative mutation 
rate and gene length was not observed for either Drosophila or the nematode (data not shown). 
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3.2 Modulation of Recombination Rates 



An alternative means of evolution is recombination, and recombination rates are known to 
be correlated with codon usage bias (|Comeron a/., 19^. Selection pressure on short genes for 



greater evolvability could favor a higher recombination rate per base, thus allowing short genes to 
evolve at a rate comparable to that of long genes. It would be unfavorable if evolution for higher 
recombination rates led to lower conservative or non-conservative mutation rates. C+G content is 
known to be a rough measure of recombination rate ( Eyre-Walker, 199^|Comeron and Kreitman, 2000t 



Duret et ah, 2000t [Birdsell, 2002| ). In other words, the correlation between C-i-G content and re- 



combination rate is strong enough that C-i-G content is now felt to be a useful maker of local recom- 



bination rate ( |Fullerton et al, 20011 [Birdsell, 2002| ). Interestingly, we find that C-i-G is positively 



correlated with all three mutation rates and is most highly correlated with synonymous mutation 
rate. Moreover, as Figure shows, the codon usage of short genes is such that a higher per base 
rate of estimated recombination is favored. The recombination rate of amino acid a is estimated 
by 

estimated recombination rate(a) = ^^p'i x (number of C or G bases in codon i) , (5) 



where the codon usage is taken from the experimental RSCU values dDuret and Mouchiroud, 1999 1 



In Figure|5^, only one exception, for proline, is found to the general pattern. As Figure|5j5 shows, a 
similar correlation between codon usage and enhanced estimated recombination frequency is also 
observed in Drosophila. No exceptions to the general pattern are found in Figure|5j5. Finally, Fig- 
ure |5|: shows the estimated recombination rate for A. thaliana. In Figure |5]:, only one exception, 
for glycine, is found to the general pattern. Considering all three species, the probability of 52 or 
more amino acids showing this trend out of 54 by chance is [(|) + {H) + {l\)]2-^^ = 8.2 x 10"^^ 
The pattern is, thus, highly statistically significant. One explanation for the observed codon usage 
of short, high-expression genes is selective pressure on crossover frequency. On a long time scale, 
other factors such as neutral evolution and rearrangements become important, and this is likely the 
reason for the relatively modest shifts in the codon usage observed in Figure|5l 

In Figure |6^ is shown the measured recombination rate versus protein length for genes in 
Drosophila at high expression levels (|Hey and Kliman, 2002|) (EST > 50). In this species, codon 



bias is observable for genes at all recombination levels. The correlation between codon bias and 



recombination rate is seen, however, only when the latter is low rates ( Hey and Kliman, 2002 
Marais and Piganeau, 2002). Figure|6lis, therefore, made only for recombination rates less than 1 
centimorgan per megabase. A negative correlation between recombination rate and protein length 
is observed. In Figure 0;, the measured recombination rate versus protein length is shown for C. 
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elegans for genes at high expression levels ( |Marais and Piganeau, 2002| ). A clear negative correla- 
tion between recombination rate and gene length is again observed. 



4 Discussion 

4.1 Selective Pressures on Codon Mutation Rates 

It was found that for the Taq polymerase, nonpolar amino acids are mutated at an elevated rate. 
Nonpolar amino acids are more frequently present in the interior cores of proteins, and mutations of 
these amino acids more often lead to dramatic rearrangements of the protein structure. The pattern 
in the error-prone PCR mutation plot suggests that the mutations that occur will tend to cause larger 
changes in the structure of the encoded protein. It is becoming more accepted that large mutation 
events such as transpositions, horizontal transfers, gene exchange, and non-conservative mutations 
are necessary for dramatic evolution. This was shown quantitatively in ( |Bogarad and Deem, 1999 1. 



Non-conservative mutations in the core of the protein would be one of the most dramatic amino 
acid substitution moves possible and can be considered to search the protein sequence space most 
broadly. In other words, under error-prone conditions, the Taq polymerase favors codons for the 
nonpolar amino acids that mutate non-conservatively. This property of error-prone PCR greatly 
enhances the ability of this method to improve protein function effectively by forcing the search of 
greater regions of tertiary fold space. Moreover, the average mutational tendencies of Taq can be 
modulated by codon usage. Tabled] defines codons by their tendencies to evolve under error prone 
conditions. These data can be useful in the design of protein evolution experiments, especially 
when trying to evolve new motifs ab initio. 

It was found that for V regions of mouse antibodies there is an increase in the mutation rate 
of the charge amino acids. These trends are not sensitive to whether equation [T] or equation |3] is 
used to model the mutation matrix or whether the mutation data are taken from dSmith et ah, 1996] 



Shapiro et al., 1999 1 or from (Neuberger and Milstein, 1995 1. Antibody V regions undergo DNA 



swapping of gene fragments in order to create the primary repertoire needed to develop resistance 
to disease. Therefore, base mutations that alter the framework of the proteins become less nec- 
essary. More significant are mutations that lead to a greater binding affinity. In protein-protein 
complexes, a positive correlation is observed between binding affinity and the number of ionic 
interactions spanning an interface ( |Sheinerman et al, 2000t \Km et al, 1997| ). Thus for the polar 



amino acids participating in binding, high conservative and non-conservative mutabilities would 
be most favorable, since such characteristics would enable more efficient searching of sequence 
space to optimize binding. 
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4.2 Selective Pressures on Recombination Rates 



Previously, a correlation between codon usage bias and gene length had been observed in the 
species considered here dDuret and Mouchiroud, 199^ . Several mechanisms that might explain 
the increased codon bias in short genes were considered, including biased tRNA levels, but all 
predicted increased bias for longer genes, in contrast to the greater observed bias for shorter genes 
dDuret and Mouchiroud, 1999] ). We suggest that codon usage in short genes in these species has 
evolved due to selection for increased recombination frequency. Figure^ This mechanism is con- 
sistent with previously observed positive correlations between recombination rate and codon usage 
bias and with previously observed negative correlations between gene length and codon usage bias 
dComeron et ah, \999\ [Comeron and Kreitman, 2000| ). The observed correlation between codon 
usage and synonymous mutation rate. Figure IH may be a byproduct of selection on recombina- 
tion rate, as synonymous mutation rate is positively correlated with C-i-G content {R = 0.62 for 
Drosophila, and R = 0.51 for the nematode). 

In dDuret and Mouchiroud, 1999] ), the codon usage bias was highest for those genes at high 
expression levels, and Figure|5]is based upon those data. In fact, the expression level was estimated 
in dDuret and Mouchiroud, 199^ from the frequency with which those genes were observed in the 
EST database. It is possible that certain genes may be overrepresented in the EST database, in a 
way that is correlated with the gene length. If this unknown bias were the cause of the correlation 
in Figure |5J then the opposite or no correlation would be expected to be observed for genes at 
low expression. In fact (data not shown), the same patterns observed in Figure are observed 
when codon usage for the genes at low (bottom 1/3 of genes with non-zero EST abundance) rather 
than high (top 1/3 of genes with non-zero EST abundance) expression levels are used: Among the 
54 amino acids, only three have lower estimated recombination rates for the short genes at low 
expression levels than for the long genes at low expression levels. 

It might be argued that to be fully consistent with our theory, the relevant recombination rate is 
that of the whole gene, divided by the coding length of the gene. This quantity is slightly different 
from the quantity plotted in Figure |5^c because the intron to exon composition of genes could vary 
systematically with length. This concern has been addressed in Figure 05, where recombination 
rate times gene length divided by coding length has been plotted. The same negative correlation 
between recombination rate and gene length is again observed. 

For our explanation to be consistent, it must be the case that Drosophila and C. elegans are, in 
some sense, mutationally starved. The very existence of the Hill-Robertson effect in these species 
( |Marais and Piganeau, 2002| ) implies that this is the case, because it implies that point mutation 
is insufficient to evolve linked genes, and that recombination is necessary to break the linkage. 
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The existence of related effects, such as interference selection dComeron and Rreitman, lO UT), 
provides additional evidence for the same reasons. Finally, the fact that codon bias is observable 



only for genes at low recombination rates in Drosophila, less than 1 ( Marais and Piganeau, 2002 1 



or 1.5 ( Hey and Kliman, 2002 1 cM/Mb, provides additional indirect evidence that the selective 
pressure to increase evolution rates is strongest where evolution is the slowest. 



5 Conclusion 

Previous treatments of the evolutionary biology of codon usage have largely ignored the pos- 
sibility that codon usage could affect mutation or recombination rates and have primarily focused 
on using codon usage as a measure of selection. We here suggest that not only can codon usage 
affect mutation and recombination rates but also codon usage has been selected to enhance func- 
tional gene adaptation within the context of the genetic code. This line of reasoning is in accord 
with strategies for optimized design of experimental protein molecular evolution protocols, where 
speed of evolution is an explicit goal ( |Bogarad and Deem, 1999[|Moore and Maranas, 2002| ). 

In Nature there are numerous examples of exploiting codon potentials in ongoing evolution- 
ary processes. In the V regions of encoded antibodies, high-potential serine codons such as AGC 
are found predominately in the encoded CDR loops while the encoded frameworks contain low- 
potential serine codons such as TCT ( [Wagner a/., 1995) ). Unfortunately, antibodies and drugs 
are often no match for the hydrophilic, high-potential codons of "error-prone" pathogens. The dra- 
matic mutability of the HIV gpl20 coat protein is one such example. One can envision a scheme 
for using codon potentials to target disease epitopes that mutate rarely {i.e., low-potential) and 
unproductively (i.e., become stop, low-potential, or structure-breaking codons). Such a therapeu- 
tic scheme should be generally useful against diseases that use error prone replication to escape 
therapeutic treatments or vaccines. 
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Table 1: Table of codon classifications for the error-prone PCR system. 



synonymous conservative non-conservative 

Cys TGC, TCT 

Ser TCA,TCT TCC, TCG AGC, AGT 

Thr ACA ACC, ACG, ACT 

Pro CCA, CCC, CCT CCG 

Ala GCA, GCT GCC, GCG 

Gly GGA, GGT GGC, GGG 

Asn AAC, AAT 

Gin CAA, CAG 

Asp GAC, GAT 

Glu GAA, GAG 

His CAC, CAT 

Arg CGA, CGT AGA, AGG, CGC 

CGG 

Lys AAA, AAG 

Met ATG 

He ATA ATC, ATT 

Leu CTA CTC, CTG, CTT 

TTA, TTG 

Val GTA, GTC, GTG 

GTT 

Phe TTC, TTT 

Tyr TAC, TAT 

Trp TGG 

Stop TAA, TAG, TGA 
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Figure 1: The codon mutability plot for a) error-prone PGR and b) V regions of mouse antibod- 
ies. Each plot displays the synonymous, conservative, and non-conservative mutabilities for each 
codon. 
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Figure 2: The no-bias plot for a) error-prone PCR and b) V regions of mouse antibodies. This 
refinement to the codon mutability plots takes into account the baseline substitution rate due to the 
inherent structure in the genetic code. 




Figure 3: No-bias plot for the non-immune-system genes c-Myc, survivinl, survivin2, and TBP in 
human B cells. 
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Figure 4: Synonymous mutabilities for Drosophila for amino acids in short (< 333 amino acids) 
and long (> 570 amino acids) genes at high expression levels (top 1/3 of genes with non-zero EST 
abundance). Higher values of synonymous mutability are observed in the shorter genes. 
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Figure 5: Estimated recombination frequency for a) C. elegans, b) D. melanogaster, and c) A. 
thaliana for amino acids in short (< 333 amino acids) and long (> 570 amino acids) genes at high 
expression levels (top 1/3 of genes with non-zero EST abundance). Higher values of estimated 
recombination frequency are observed in the shorter genes. Recombination frequency is estimated 
by the sum over all codons encoding a given amino acid of the observed codon usage times the 
number of C and G bases in the codon. 
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Figure 6: Measured recombination frequency (centimorgan/megabase) as a function of protein 
length (amino acids) for a) D. melanogaster, b) D. melanogaster, where recombination frequency 
is modifed to account for intron to exon base composition, Rx (gene length) / (coding length), and 
c) C. elegans. Also shown are linear fits to the data; the correlation coefficients are a) i? = —0.32, 
b) -R = —0.20, and c) R = —0.89. All data are for genes at high expression levels. Data in a) 
and b) are taken from ( |Hey and Kliman, 2002, ). Data in c) are replotted from the binned data of 
( |Marais and Piganeau, 2002| ). 
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Supplementary Information 



Table 2: The base mutation matrix t for error-prone PGR fMoo re and Maranas, 2000| ). 
A C G T 

A 1-7.4x10-6 7.0x10-^ 3.2x10-6 3.5x10^6 

C 7.0x10-^ 1-2.8x10-6 2.1x10-6 

G 2.1x10-6 1-2.8x10-6 7.0x10-^ 

T 3.5x10-6 3.2x10-6 7.0x10-^ 1-7.4x10-6 



Table 3: The base mutation matrix t for V regions of mouse antibodies ([Smith et ah, 1996 



Shapiro gfg/., 1999j)." 



A 



C 



G 



A 1-3.327x10^ 



6.550x10- 



1.847x10- 



8.250x10- 



4.580x10-6 1-2.597x10- 



4.370x10- 



1.702x10-5 



G 1.329x10- 



6.200x10-6 1-2.407x10- 



4.580x10- 



4.360x10" 



7.270x10" 



3.960x10-6 1-1.559x10" 



"For example t^c = 6.550 x 10 6. 
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Table 4: The base mutation matrix t for Drosophila dPetrov and Hartl, 1999| ). Only the 
relative rates are significant. 

A C G T 



A 1-4.0x10"^ 1.0x10"^ 1.6x10"^ 1.4x10-^ 

C 1.8x10-5 1-6.0x10-5 1.2x10-5 3.0x10-5 

G 3.0x10-5 1.2x10-5 1-6.0x10-5 1.8x10-5 

T 1.4x10-5 1.6x10-5 1.0x10-5 1-4.0x10-5 



Table 5: The base mutation matrix for human, non-immune system B-cell genes 
( Shen et ah, 2000) . Only the relative rates are significant. Data were generated from 
counts of observed mutations by the summing the formula Ai'observed mutations x^y = 
Nc\onesNha.sesP{x)P{x y\x) ovcr all gcncs examined, and averaging the mutation matrix 

so obtained over the two donors. 

A C G T 



A 1-1.86x10-5 0.0x10-5 1.4x10-5 0.46x10-5 

C 2.2x10-5 1-5.73x10-5 0.33x10-5 3.2x10-5 

G 5.9x10-5 0.0x10-5 1-7.5x10-5 1.6x10-5 

T 0.71x10-5 2.3x10-5 0.71x10-5 1-3.72x10-5 



Table 6: The base mutation matrix t for mitochondrial DNA in a nematode 

dBlouin g?Q/., 1998). 

A C G T 

A 1-7.60x10-^ 4.00x10-^ 6.90x10-^ 3.00x10"* 

C 2.20x10-^ 1-2.41x10-^ 1.00x10-^ 2.09x10-*^ 

G 3.02x10-6 8.00x10-* 1-3.15x10-^ 5.00x10"* 

T 3.00x10-* 4.00x10-^ 1.00x10-* 1-4.40x10-^ 
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Caption for Cover Figure 



Emerging patterns from the inherent structure of the genetic code and non-uniform mutation 
rates in error prone replication. The relative rate of codon mutation above baseline (blue) is shown 
by color intensity. Non-baseline synonymous changes are green; conservative, orange; and non- 
conservative, red. The codons are ordered by AAX, CAX, GAX, TAX, ACX, 
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